In this practical you will be answering a research question or solving a problem. For that you will create a pipeline for classification or clustering.

All the data is processed and can be found on the github repository.

Here are some proposed research questions:

Classification

RQ1: Identification of fake news, hate speech or spam + Interpretability of results:

RQ2: Evaluate the importance of metadata. Create a classification system to identify the movie genre using and excluding metadata:

Clustering:

RQ3: Create a recommendation system for movies based on their plot:

RQ4: Cluster headlines using word embeddings:

You can come up with your own research question using any dataset on text analysis, e.g. from:

RQ1: Identification of hate speech

We provide code for the first dataset. Your goal is to (1) improve the classifier by using a more advanced method (2)

Data: Dataset of hate speech annotated on Internet forum posts in English at sentence-level. The source forum in Stormfront, a large online community of white nacionalists. A total of 10,568 sentence have been been extracted from Stormfront and classified as conveying hate speech or not

Step 1: Read data and create train-test split

Step 2: Create pipeline and hyperparameter tuning

Create a pipeline that vectorizes the text and transform it using TF-IDF, and classifies the news titles using LogisticRegression.

Step 3: Interpretation of results

Interpretation of coefficients in the linear model

We can use the coefficients of the Logistic regression

Interpretation of coefficients using LIME (Local Interpretable Model-Agnostic Explanations)

LIME modifies the text to understand the impact of each word to the predictions.

Now it's your turn.

Either: